NLP Preprocesing
Here, we will performing pre-processing of documents using nltk, which is important before feature extraction phase
Downloading important resources
Before proceeding further, we need to install nltk library suite in our system using the command
pip install nltk
!pip install nltk
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 36.5 MB/s eta 0:00:00
Requirement already satisfied: click in /opt/python/envs/default/lib/python3.8/site-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /opt/python/envs/default/lib/python3.8/site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /opt/python/envs/default/lib/python3.8/site-packages (from nltk) (2024.5.15)
Requirement already satisfied: tqdm in /opt/python/envs/default/lib/python3.8/site-packages (from nltk) (4.66.4)
Installing collected packages: nltk
Successfully installed nltk-3.8.1
[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: pip install --upgrade pip
# After installing , we need to import and download the resources (corpora, models, ...)
import nltk
nltk.download('all')
[nltk_data] |
[nltk_data] | Downloading package abc to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.
[nltk_data] | Downloading package cmudict to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package conll2000 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk_data] | Downloading package conll2007 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package crubadan to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/crubadan.zip.
[nltk_data] | Downloading package dependency_treebank to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/dependency_treebank.zip.
[nltk_data] | Downloading package dolch to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/dolch.zip.
[nltk_data] | Downloading package europarl_raw to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/europarl_raw.zip.
[nltk_data] | Downloading package extended_omw to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package floresta to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/floresta.zip.
[nltk_data] | Downloading package framenet_v15 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/framenet_v15.zip.
[nltk_data] | Downloading package framenet_v17 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/framenet_v17.zip.
[nltk_data] | Downloading package gazetteers to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/gazetteers.zip.
[nltk_data] | Downloading package genesis to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/genesis.zip.
[nltk_data] | Downloading package gutenberg to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/gutenberg.zip.
[nltk_data] | Downloading package ieer to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/ieer.zip.
[nltk_data] | Downloading package inaugural to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/inaugural.zip.
[nltk_data] | Downloading package indian to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/indian.zip.
[nltk_data] | Downloading package jeita to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package kimmo to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/kimmo.zip.
[nltk_data] | Downloading package knbc to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package large_grammars to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping grammars/large_grammars.zip.
[nltk_data] | Downloading package lin_thesaurus to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/lin_thesaurus.zip.
[nltk_data] | Downloading package mac_morpho to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/mac_morpho.zip.
[nltk_data] | Downloading package machado to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package masc_tagged to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package maxent_ne_chunker to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] | Downloading package maxent_treebank_pos_tagger to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data] | Downloading package moses_sample to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping models/moses_sample.zip.
[nltk_data] | Downloading package movie_reviews to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/movie_reviews.zip.
[nltk_data] | Downloading package mte_teip5 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/mte_teip5.zip.
[nltk_data] | Downloading package mwa_ppdb to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping misc/mwa_ppdb.zip.
[nltk_data] | Downloading package names to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/names.zip.
[nltk_data] | Downloading package nombank.1.0 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package nonbreaking_prefixes to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/nonbreaking_prefixes.zip.
[nltk_data] | Downloading package nps_chat to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/nps_chat.zip.
[nltk_data] | Downloading package omw to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package omw-1.4 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package opinion_lexicon to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/opinion_lexicon.zip.
[nltk_data] | Downloading package panlex_swadesh to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package paradigms to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/paradigms.zip.
[nltk_data] | Downloading package pe08 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/pe08.zip.
[nltk_data] | Downloading package perluniprops to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping misc/perluniprops.zip.
[nltk_data] | Downloading package pil to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/pil.zip.
[nltk_data] | Downloading package pl196x to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/pl196x.zip.
[nltk_data] | Downloading package porter_test to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping stemmers/porter_test.zip.
[nltk_data] | Downloading package ppattach to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/ppattach.zip.
[nltk_data] | Downloading package problem_reports to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/problem_reports.zip.
[nltk_data] | Downloading package product_reviews_1 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/product_reviews_1.zip.
[nltk_data] | Downloading package product_reviews_2 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/product_reviews_2.zip.
[nltk_data] | Downloading package propbank to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package pros_cons to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/pros_cons.zip.
[nltk_data] | Downloading package ptb to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/ptb.zip.
[nltk_data] | Downloading package punkt to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping tokenizers/punkt.zip.
[nltk_data] | Downloading package qc to /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/qc.zip.
[nltk_data] | Downloading package reuters to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package rslp to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping stemmers/rslp.zip.
[nltk_data] | Downloading package rte to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/rte.zip.
[nltk_data] | Downloading package sample_grammars to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping grammars/sample_grammars.zip.
[nltk_data] | Downloading package semcor to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package senseval to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/senseval.zip.
[nltk_data] | Downloading package sentence_polarity to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/sentence_polarity.zip.
[nltk_data] | Downloading package sentiwordnet to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/sentiwordnet.zip.
[nltk_data] | Downloading package shakespeare to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/shakespeare.zip.
[nltk_data] | Downloading package sinica_treebank to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/sinica_treebank.zip.
[nltk_data] | Downloading package smultron to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/smultron.zip.
[nltk_data] | Downloading package snowball_data to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package spanish_grammars to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping grammars/spanish_grammars.zip.
[nltk_data] | Downloading package state_union to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/state_union.zip.
[nltk_data] | Downloading package stopwords to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/stopwords.zip.
[nltk_data] | Downloading package subjectivity to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/subjectivity.zip.
[nltk_data] | Downloading package swadesh to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/swadesh.zip.
[nltk_data] | Downloading package switchboard to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/switchboard.zip.
[nltk_data] | Downloading package tagsets to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping help/tagsets.zip.
[nltk_data] | Downloading package timit to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/timit.zip.
[nltk_data] | Downloading package toolbox to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/toolbox.zip.
[nltk_data] | Downloading package treebank to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/treebank.zip.
[nltk_data] | Downloading package twitter_samples to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/twitter_samples.zip.
[nltk_data] | Downloading package udhr to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/udhr.zip.
[nltk_data] | Downloading package udhr2 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/udhr2.zip.
[nltk_data] | Downloading package unicode_samples to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/unicode_samples.zip.
[nltk_data] | Downloading package universal_tagset to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping taggers/universal_tagset.zip.
[nltk_data] | Downloading package universal_treebanks_v20 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package vader_lexicon to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package verbnet to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/verbnet.zip.
[nltk_data] | Downloading package verbnet3 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/verbnet3.zip.
[nltk_data] | Downloading package webtext to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/webtext.zip.
[nltk_data] | Downloading package wmt15_eval to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping models/wmt15_eval.zip.
[nltk_data] | Downloading package word2vec_sample to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping models/word2vec_sample.zip.
[nltk_data] | Downloading package wordnet to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package wordnet2021 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package wordnet2022 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/wordnet2022.zip.
[nltk_data] | Downloading package wordnet31 to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Downloading package wordnet_ic to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/wordnet_ic.zip.
[nltk_data] | Downloading package words to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/words.zip.
[nltk_data] | Downloading package ycoe to
[nltk_data] | /home/datalore/nltk_data...
[nltk_data] | Unzipping corpora/ycoe.zip.
[nltk_data] |
[nltk_data] Done downloading collection all
A pop-up window will open and we need to select All to download all the packages, corporas provided by nltk
Get a raw document from the nltk corpora
We will be retrieving a sample document from the nltk corpora, here we will be using the state_union corpus
from nltk.corpus import state_union
sample_document = state_union.raw("2006-GWBush.txt")
sample_document[:100]
Stop-word Removal
Here we are concerned about filtering the document from unnecessary words which include determiners, articles, ... But first we need to tokenize the document into words, and then filter by stop_word removal
# Word_tokenize the document
from nltk.tokenize import word_tokenize
words = word_tokenize(sample_document)
# Print the first 15 words
print(words[:15])
# Stop-word
from nltk.corpus import stopwords
# Set of english stopwords
stop_words = set(stopwords.words("english"))
# Filter by stop_words
words = [w for w in words if w.lower() not in stop_words]
# Print the first 15 words
print(words[:15])
Stemming
After stop-word removal, we perform stemming, which is basically a process of reduction of words into their root_word, without any consideration for meaning
# For stemming we will use the PorterStemmer class provided by nltk
from nltk.stem import PorterStemmer
ps = PorterStemmer()
word_stems = []
for word in words:
word_stem = ps.stem(word)
word_stems.append(word_stem)
# Printing the first 15 stems
print(word_stems[:15])
As we can see from above that the word president is reduced to its root word presid
Lemmatization
Unlike stemming, lemmatization reduces a word to its root_word, such that the root_word has some meaning.
# For lemmatization we will use the WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word_lemmas=[]
for word in words:
word_lemma = lemmatizer.lemmatize(word)
word_lemmas.append(word_lemma)
# Printing the first 15 lemmas
print(word_lemmas[:15])
Here unlike stemming, the meaning of the lemmas are intact
Part of Speech Tagging
To maintain the context throughout, it is necessary to tag words by their parts of speech, so that we can filter them by tags to do analysis
pos_tags = nltk.pos_tag(words)
# Print the first 15 words by their tags
print(pos_tags[:15])
The tag list and their meaning is described herewith
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
Named-Entity Recognition
Apart from tagging words by their parts of speech, we can also group the words based on their context and meaning like grouping by name of persons, date , location, organistaion
namedEnt = nltk.ne_chunk(pos_tags)
# Print the first 15 chunks
print(namedEnt[:15])
To get more clarity we can use namedEnt.draw() to get the tree-structure of the named entities
Word embeddings
import numpy as np
Step 2:
- We take the set of unique words in our dataset
unique_words = set()
for sentence in sample_document.split('\n'):
for word in sentence.split():
unique_words.add(word.lower())
Step 3:
- We create a dictionary to map each unique word to an index
word_to_index = {}
for i, word in enumerate(unique_words):
word_to_index[word] = i
Step 4:
- We initialize a 2D matrix of all 0s for our entire set of one-hot vectors
one_hot_vectors = np.zeros((len(unique_words), len(unique_words)))
Step 5:
- We set the exact position to 1 for each word as per our mapping method
for word in unique_words:
sentence_vectors = []
idx = word_to_index[word]
one_hot_vectors[idx, idx] = 1
one_hot_vectors[word_to_index['taxpayer']]
Step 1:
- We will first preprocess the data, in order to:
- Convert text to lower case.
- Remove all non-word characters.
- Remove all punctuations.
import re
dataset = nltk.sent_tokenize(sample_document)
for i in range(len(dataset)):
dataset[i] = dataset[i].lower()
dataset[i] = re.sub(r'\W', ' ', dataset[i])
dataset[i] = re.sub(r'\s+', ' ', dataset[i])
Step 2: Obtaining most frequent words in our text.
We will apply the following steps to generate our model:
- We declare a dictionary to hold our bag of words.
- Next we tokenize each sentence to words.
- Now for each word in sentence, we check if the word exists in our dictionary.
- If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.
word2count = {}
for data in sample_document.split(' '):
words = nltk.word_tokenize(data)
for word in words:
if word not in word2count.keys():
word2count[word] = 1
else:
word2count[word] += 1
Sample count:
word2count['taxpayer']
The entire BoW for our dataset:
word2count
'GEORGE': 1,
'W': 5,
'.': 349,
'BUSH': 1,
"'S": 1,
'ADDRESS': 1,
'BEFORE': 1,
'A': 8,
'JOINT': 1,
'SESSION': 1,
'OF': 2,
'THE': 4,
'CONGRESS': 1,
'ON': 1,
'STATE': 1,
'UNION': 1,
'January': 1,
'31': 6,
',': 319,
'2006': 7,
':': 15,
'Thank': 1,
'you': 14,
'all': 19,
'Mr': 1,
'Speaker': 1,
'Vice': 1,
'President': 7,
'Cheney': 1,
'members': 10,
'of': 186,
'Congress': 13,
'the': 260,
'Supreme': 2,
'Court': 2,
'and': 239,
'diplomatic': 1,
'corps': 1,
'distinguished': 1,
'guests': 1,
'fellow': 4,
'citizens': 9,
'Today': 4,
'our': 96,
'nation': 17,
'lost': 2,
'a': 104,
'beloved': 1,
'graceful': 1,
'courageous': 1,
'woman': 1,
'who': 15,
'called': 2,
'America': 43,
'to': 170,
'its': 8,
'founding': 1,
'ideals': 2,
'carried': 1,
'on': 22,
'noble': 2,
'dream': 2,
'Tonight': 7,
'we': 85,
'are': 36,
'comforted': 1,
'by': 37,
'hope': 8,
'glad': 1,
'reunion': 1,
'with': 26,
'husband': 1,
'was': 3,
'taken': 3,
'so': 12,
'long': 2,
'ago': 1,
'grateful': 4,
'for': 30,
'good': 7,
'life': 10,
'Coretta': 1,
'Scott': 1,
'King': 2,
'(': 68,
'Applause': 64,
')': 68,
'George': 4,
'Bush': 6,
'reacts': 1,
'applause': 3,
'during': 2,
'his': 7,
'State': 5,
'Union': 7,
'Address': 4,
'at': 17,
'Capitol': 6,
'Tuesday': 5,
'Jan': 5,
'White': 6,
'House': 6,
'photo': 5,
'Eric': 5,
'DraperEvery': 1,
'time': 5,
'I': 32,
"'m": 2,
'invited': 1,
'this': 25,
'rostrum': 1,
'humbled': 1,
'privilege': 2,
'mindful': 1,
'history': 7,
"'ve": 6,
'seen': 1,
'together': 5,
'We': 38,
'have': 34,
'gathered': 1,
'under': 2,
'dome': 1,
'in': 122,
'moments': 1,
'national': 4,
'mourning': 1,
'achievement': 1,
'served': 1,
'through': 2,
'one': 10,
'most': 5,
'consequential': 1,
'periods': 1,
'--': 61,
'it': 22,
'has': 25,
'been': 8,
'my': 6,
'honor': 7,
'serve': 4,
'In': 15,
'system': 3,
'two': 8,
'parties': 5,
'chambers': 1,
'elected': 2,
'branches': 1,
'there': 7,
'will': 59,
'always': 4,
'be': 23,
'differences': 2,
'debate': 1,
'But': 5,
'even': 3,
'tough': 1,
'debates': 2,
'can': 19,
'conducted': 1,
'civil': 1,
'tone': 1,
'not': 29,
'allowed': 1,
'harden': 1,
'into': 3,
'anger': 1,
'To': 5,
'confront': 4,
'great': 9,
'issues': 1,
'before': 4,
'us': 22,
'must': 20,
'act': 7,
'spirit': 2,
'goodwill': 1,
'respect': 4,
'another': 6,
'do': 7,
'part': 1,
'state': 2,
'is': 65,
'strong': 2,
'make': 11,
'stronger': 4,
'decisive': 1,
'year': 11,
'choices': 2,
'that': 66,
'determine': 1,
'both': 7,
'future': 7,
'character': 3,
'country': 17,
'choose': 4,
'confidently': 1,
'pursuing': 1,
'enemies': 4,
'freedom': 17,
'or': 23,
'retreat': 6,
'from': 23,
'duties': 1,
'an': 15,
'easier': 2,
'build': 4,
'prosperity': 3,
'leading': 1,
'world': 27,
'economy': 16,
'shut': 1,
'ourselves': 1,
'off': 3,
'trade': 3,
'opportunity': 4,
'complex': 1,
'challenging': 1,
'road': 3,
'isolationism': 2,
'protectionism': 1,
'may': 1,
'seem': 2,
'broad': 1,
'inviting': 1,
'yet': 2,
'ends': 1,
'danger': 1,
'decline': 2,
'The': 18,
'only': 10,
'way': 6,
'protect': 3,
'people': 22,
'secure': 4,
'peace': 6,
'control': 3,
'destiny': 1,
'leadership': 4,
'United': 9,
'States': 9,
'continue': 5,
'lead': 10,
'Abroad': 1,
'committed': 2,
'historic': 1,
'long-term': 2,
'goal': 6,
'seek': 5,
'end': 4,
'tyranny': 1,
'Some': 1,
'dismiss': 2,
'as': 10,
'misguided': 1,
'idealism': 1,
'reality': 1,
'security': 6,
'depends': 2,
'On': 1,
'September': 2,
'11th': 2,
'2001': 3,
'found': 1,
'problems': 1,
'originating': 1,
'failed': 2,
'oppressive': 1,
'7,000': 1,
'miles': 1,
'away': 2,
'could': 6,
'bring': 3,
'murder': 4,
'destruction': 2,
'Dictatorships': 1,
'shelter': 1,
'terrorists': 5,
'feed': 2,
'resentment': 2,
'radicalism': 2,
'weapons': 3,
'mass': 3,
'Democracies': 2,
'replace': 2,
'rights': 2,
'their': 20,
'neighbors': 1,
'join': 2,
'fight': 9,
'against': 7,
'terror': 6,
'Every': 2,
'step': 1,
'toward': 6,
'makes': 2,
'safer': 1,
'boldly': 2,
"'s": 13,
'cause': 3,
'Far': 1,
'being': 2,
'hopeless': 2,
'advance': 2,
'story': 2,
'1945': 1,
'were': 2,
'about': 13,
'dozen': 2,
'lonely': 1,
'democracies': 2,
'122': 1,
'And': 19,
"'re": 12,
'writing': 1,
'new': 12,
'chapter': 1,
'self-government': 1,
'women': 7,
'lining': 1,
'up': 5,
'vote': 2,
'Afghanistan': 2,
'millions': 2,
'Iraqis': 4,
'marking': 1,
'liberty': 4,
'purple': 1,
'ink': 1,
'men': 7,
'Lebanon': 2,
'Egypt': 2,
'debating': 1,
'individuals': 2,
'necessity': 1,
'At': 2,
'start': 3,
'more': 23,
'than': 17,
'half': 6,
'live': 3,
'democratic': 3,
'nations': 5,
'forget': 3,
'other': 6,
'places': 2,
'like': 7,
'Syria': 1,
'Burma': 1,
'Zimbabwe': 1,
'North': 1,
'Korea': 1,
'Iran': 4,
'because': 12,
'demands': 1,
'justice': 3,
'require': 1,
'well': 4,
'delivers': 1,
'Draper': 4,
'No': 2,
'deny': 1,
'success': 2,
'but': 9,
'some': 2,
'rage': 1,
'main': 1,
'sources': 4,
'reaction': 1,
'opposition': 2,
'radical': 2,
'Islam': 2,
'perversion': 1,
'few': 2,
'faith': 1,
'ideology': 1,
'death': 4,
'Terrorists': 1,
'bin': 2,
'Laden': 2,
'serious': 2,
'take': 6,
'declared': 1,
'intentions': 1,
'seriously': 1,
'They': 6,
'impose': 1,
'heartless': 1,
'totalitarian': 1,
'throughout': 2,
'Middle': 6,
'East': 5,
'arm': 1,
'themselves': 1,
'Their': 1,
'aim': 1,
'seize': 1,
'power': 5,
'Iraq': 6,
'use': 5,
'safe': 2,
'haven': 1,
'launch': 1,
'attacks': 3,
'Lacking': 1,
'military': 8,
'strength': 1,
'challenge': 3,
'directly': 2,
'chosen': 1,
'weapon': 1,
'fear': 4,
'When': 1,
'they': 16,
'children': 7,
'school': 3,
'Beslan': 1,
'blow': 1,
'commuters': 1,
'London': 1,
'behead': 1,
'bound': 1,
'captive': 1,
'these': 9,
'horrors': 1,
'break': 2,
'allowing': 2,
'violent': 1,
'inherit': 1,
'Earth': 1,
'miscalculated': 1,
'love': 3,
'keep': 6,
'testing': 1,
'find': 1,
'abandoning': 1,
'commitments': 2,
'retreating': 1,
'within': 2,
'borders': 2,
'If': 4,
'leave': 2,
'vicious': 1,
'attackers': 1,
'alone': 4,
'would': 9,
'simply': 1,
'move': 4,
'battlefield': 1,
'own': 11,
'shores': 1,
'There': 2,
'no': 5,
'By': 4,
'work': 7,
'leaving': 2,
'assaulted': 1,
'fend': 1,
'itself': 1,
'signal': 1,
'longer': 2,
'believe': 2,
'courage': 4,
'friends': 4,
'certain': 1,
'never': 7,
'surrender': 1,
'evil': 2,
'rejects': 2,
'false': 1,
'comfort': 1,
'saved': 1,
'Europe': 2,
'liberated': 1,
'camps': 1,
'helped': 2,
'raise': 3,
'faced': 2,
'down': 3,
'empire': 1,
'Once': 1,
'again': 4,
'accept': 1,
'call': 1,
'deliver': 3,
'oppressed': 1,
'remain': 3,
'offensive': 6,
'networks': 2,
'killed': 3,
'captured': 1,
'many': 5,
'leaders': 3,
'others': 2,
'day': 5,
'come': 6,
'greets': 1,
'after': 1,
'where': 1,
'fine': 1,
'National': 1,
'Assembly': 1,
'fighting': 4,
'while': 3,
'building': 1,
'institutions': 5,
'democracy': 3,
'clear': 2,
'plan': 2,
'victory': 5,
'First': 3,
'helping': 3,
'inclusive': 1,
'government': 9,
'old': 2,
'resentments': 1,
'eased': 1,
'insurgency': 1,
'marginalized': 1,
'Second': 2,
'continuing': 1,
'reconstruction': 2,
'efforts': 4,
'Iraqi': 6,
'corruption': 2,
'modern': 1,
'experience': 2,
'benefits': 1,
'third': 1,
'striking': 1,
'terrorist': 5,
'targets': 1,
'train': 2,
'forces': 4,
'increasingly': 2,
'capable': 1,
'defeating': 1,
'enemy': 4,
'showing': 3,
'every': 9,
'proud': 2,
'allies': 3,
'Our': 12,
'difficult': 1,
'brutal': 1,
'brutality': 1,
'stopped': 2,
'dramatic': 1,
'progress': 3,
'less': 1,
'three': 2,
'years': 11,
'gone': 1,
'dictatorship': 1,
'liberation': 1,
'sovereignty': 1,
'constitution': 1,
'elections': 2,
'same': 6,
'coalition': 2,
'relentless': 1,
'shutting': 1,
'infiltration': 1,
'clearing': 1,
'out': 5,
'insurgent': 1,
'strongholds': 1,
'turning': 2,
'over': 4,
'territory': 1,
'am': 4,
'confident': 4,
';': 3,
'skill': 1,
'Fellow': 2,
'win': 2,
'winning': 1,
'troops': 2,
'home': 4,
'As': 4,
'ground': 1,
'should': 5,
'able': 1,
'further': 1,
'decrease': 1,
'troop': 1,
'levels': 2,
'those': 6,
'decisions': 2,
'made': 4,
'commanders': 1,
'politicians': 1,
'Washington': 4,
'D.C': 1,
'learned': 1,
'adjusted': 1,
'tactics': 1,
'changed': 1,
'approach': 1,
'Along': 1,
'benefitted': 1,
'responsible': 2,
'criticism': 2,
'counsel': 1,
'offered': 1,
'coming': 1,
'reach': 2,
'your': 6,
'advice': 1,
'Yet': 8,
'difference': 2,
'between': 1,
'aims': 1,
'defeatism': 1,
'refuses': 1,
'acknowledge': 1,
'anything': 1,
'failure': 1,
'Hindsight': 1,
'wisdom': 1,
'second-guessing': 1,
'strategy': 1,
'With': 3,
'much': 1,
'balance': 1,
'public': 5,
'office': 1,
'duty': 2,
'speak': 2,
'candor': 1,
'sudden': 1,
'withdrawal': 1,
'abandon': 1,
'prison': 1,
'put': 4,
'Zarqawi': 1,
'charge': 1,
'strategic': 1,
'show': 6,
'pledge': 3,
'means': 1,
'little': 1,
'Members': 1,
'however': 1,
'feel': 1,
'past': 3,
'option': 1,
'word': 1,
'defeat': 3,
'stand': 1,
'behind': 2,
'American': 18,
'vital': 3,
'mission': 1,
'Laura': 2,
'applauded': 1,
'she': 1,
'introduced': 1,
'evening': 2,
'uniform': 2,
'making': 2,
'sacrifices': 2,
'sense': 1,
'know': 8,
'what': 4,
'house': 2,
'maze': 1,
'streets': 1,
'wear': 2,
'heavy': 1,
'gear': 1,
'desert': 1,
'heat': 1,
'see': 2,
'comrade': 1,
'roadside': 1,
'bomb': 1,
'costs': 2,
'also': 14,
'stakes': 1,
'Marine': 1,
'Staff': 2,
'Sergeant': 2,
'Dan': 3,
'Clay': 2,
'last': 7,
'month': 1,
'Fallujah': 1,
'He': 1,
'left': 2,
'letter': 1,
'family': 1,
'words': 1,
'just': 2,
'addressed': 1,
'Here': 2,
'wrote': 1,
'``': 1,
'...': 1,
'It': 3,
'knowledge': 1,
'....': 1,
'Never': 1,
'falter': 1,
'!': 1,
'Do': 1,
"n't": 1,
'hesitate': 1,
'support': 10,
'protecting': 3,
'which': 6,
'worth': 2,
"''": 1,
'wife': 1,
'Lisa': 1,
'mom': 1,
'dad': 1,
'Sara': 1,
'Jo': 1,
'Bud': 1,
'Welcome': 1,
'fallen': 2,
'memory': 1,
'volunteer': 1,
'brave': 1,
'let': 5,
'families': 3,
'involves': 1,
'action': 3,
'Ultimately': 1,
'dark': 1,
'vision': 1,
'hatred': 1,
'offering': 1,
'hopeful': 9,
'alternative': 4,
'political': 2,
'peaceful': 2,
'change': 3,
'So': 7,
'supports': 1,
'reform': 7,
'across': 2,
'broader': 1,
'Elections': 1,
'beginning': 1,
'Raising': 1,
'requires': 6,
'rule': 1,
'law': 4,
'protection': 2,
'minorities': 1,
'accountable': 1,
'single': 2,
'voted': 2,
'multi-party': 1,
'presidential': 1,
'election': 1,
'now': 7,
'open': 3,
'paths': 1,
'reduce': 3,
'appeal': 1,
'Palestinian': 2,
'Hamas': 1,
'recognize': 2,
'Israel': 1,
'disarm': 1,
'reject': 1,
'terrorism': 3,
'lasting': 1,
'Saudi': 1,
'Arabia': 1,
'first': 2,
'steps': 1,
'offer': 2,
'better': 6,
'pressing': 1,
'forward': 3,
'look': 2,
'reflect': 1,
'traditions': 1,
'right': 3,
'humanity': 1,
'waves': 1,
'upper': 1,
'visitors': 1,
'gallery': 1,
'Chamber': 1,
'following': 1,
'remarks': 1,
'true': 1,
'held': 1,
'hostage': 1,
'small': 3,
'clerical': 1,
'elite': 1,
'isolating': 1,
'repressing': 1,
'regime': 2,
'sponsors': 1,
'territories': 1,
'Iranian': 2,
'defying': 1,
'nuclear': 3,
'ambitions': 1,
'permit': 1,
'gain': 1,
'rally': 1,
'threats': 1,
'me': 4,
'respects': 1,
'hopes': 1,
'closest': 1,
'free': 1,
'overcome': 1,
'dangers': 1,
'encouraging': 2,
'economic': 7,
'disease': 1,
'spreading': 1,
'lands': 1,
'Isolationism': 1,
'tie': 1,
'hands': 2,
'desperate': 1,
'need': 8,
'compassion': 4,
'abroad': 2,
'Americans': 11,
'God-given': 1,
'dignity': 1,
'villager': 1,
'HIV/AIDS': 2,
'infant': 1,
'malaria': 2,
'refugee': 1,
'fleeing': 1,
'genocide': 1,
'young': 2,
'girl': 1,
'sold': 1,
'slavery': 2,
'regions': 2,
'overwhelmed': 1,
'poverty': 1,
'despair': 1,
'organized': 2,
'crime': 4,
'human': 5,
'trafficking': 2,
'drug': 3,
'recent': 2,
'unprecedented': 2,
'AIDS': 4,
'expand': 1,
'education': 2,
'girls': 1,
'reward': 1,
'developing': 1,
'moving': 1,
'For': 3,
'everywhere': 2,
'partner': 1,
'Short-changing': 1,
'increase': 4,
'suffering': 3,
'chaos': 1,
'undercut': 1,
'dull': 1,
'conscience': 2,
'urge': 3,
'interests': 2,
'here': 3,
'desire': 1,
'capability': 1,
'attack': 2,
'Fortunately': 1,
'superb': 2,
'professionals': 2,
'enforcement': 2,
'intelligence': 1,
'homeland': 1,
'These': 2,
'dedicating': 1,
'lives': 2,
'deserve': 2,
'thanks': 1,
'tools': 1,
'already': 1,
'ask': 6,
'reauthorize': 2,
'Patriot': 1,
'Act': 3,
'said': 1,
'prior': 1,
'connect': 1,
'dots': 1,
'conspiracy': 1,
'hijackers': 1,
'placed': 1,
'telephone': 1,
'calls': 1,
'al': 3,
'Qaeda': 3,
'operatives': 2,
'overseas': 1,
'did': 3,
'plans': 1,
'until': 2,
'too': 2,
'late': 1,
'prevent': 2,
'based': 1,
'authority': 3,
'given': 1,
'Constitution': 1,
'statute': 1,
'authorized': 1,
'surveillance': 2,
'program': 3,
'aggressively': 1,
'pursue': 1,
'international': 1,
'communications': 1,
'suspected': 1,
'affiliates': 1,
'Previous': 1,
'Presidents': 2,
'used': 2,
'constitutional': 1,
'federal': 6,
'courts': 3,
'approved': 1,
'Appropriate': 1,
'kept': 1,
'informed': 1,
'remains': 1,
'essential': 2,
'inside': 1,
'talking': 1,
'want': 3,
'sit': 1,
'back': 3,
'wait': 1,
'hit': 1,
'areas': 3,
'disruption': 1,
'spread': 1,
'troubled': 1,
'draw': 1,
'principles': 1,
'willing': 1,
'dramatically': 2,
'dangerous': 2,
'anxious': 1,
'values': 2,
'gave': 1,
'birth': 1,
'Roosevelt': 1,
'Truman': 1,
'Kennedy': 1,
'Reagan': 1,
'rejected': 1,
'isolation': 1,
'knew': 1,
'when': 4,
'march': 1,
'generation': 3,
'war': 2,
'determined': 2,
'fought': 1,
'steady': 1,
'bipartisan': 2,
'tonight': 4,
'yours': 1,
'Together': 1,
'defend': 1,
'strengthening': 1,
'healthy': 1,
'vigorous': 1,
'growing': 2,
'faster': 1,
'major': 1,
'industrialized': 1,
'two-and-a-half': 1,
'created': 1,
'4.6': 1,
'million': 3,
'jobs': 6,
'Japan': 1,
'European': 1,
'combined': 1,
'Even': 1,
'face': 2,
'higher': 1,
'energy': 5,
'prices': 1,
'natural': 2,
...}
First of all, we create the set of documents,
docs = []
for sentence in sample_document.split('\n'):
docs.append(sentence)
We import sklearn's TfidfVectorizer for fast idf calculation.
from sklearn.feature_extraction.text import TfidfVectorizer
After that, we initialize an object and apply the fit_transform() method on our documents.
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(docs)
We define two auxiliary functions:
def get_tf(doc):
tf = {}
doc = doc.lower()
doc_lst = doc.split(' ')
s = len(doc_lst)
for tkn in doc_lst:
if tkn not in tf:
tf[tkn] = 1
else:
tf[tkn] += 1
tf = {tkn: t/s for tkn, t in tf.items()}
return tf
def get_tfidf(vectorizer, doc):
tf = get_tf(doc)
print("\ntf-idf values:")
for tkn in tf.keys():
for w, idf in zip(vectorizer.get_feature_names_out(), vectorizer.idf_):
if tkn == w:
t = tf[tkn] * idf
print(tkn, ':', t)
Finally, we visualize the tfidf for the first document.
d = docs[0]
print(f'Document: {d}')
get_tfidf(tfidf, d)
tf-idf values:
president : 0.21574461603222617
george : 0.23172695339065846
address : 0.2231630267335885
before : 0.23172695339065846
joint : 0.29276096942777563
session : 0.29276096942777563
of : 0.20044017187780394
the : 0.29553164803891363
congress : 0.18082191273098316
on : 0.16483957537255092
state : 0.20920111405131597
union : 0.20920111405131597
Word2vec
Word2vec is a neural network based language model that converts words into vectors (embeddings) in a high-dimensional space.
This model was first introduced in this arxiv paper.
However, we would refer ch. 6 of Speech and Language Processing (3e draft) (by Dan Jurafsky and James H. Martin) and follow skip-gram with negative sampling (SGNS) method.
Link here.
At first, we import required modules.
import random
# create vocabulary
vocab = list(set(words))
# a lookup table (output -> integer) simplifies numpy array indexing
word2idx = {word:idx for idx, word in enumerate(vocab)}
VOCAB_SIZE = len(vocab)
DOC_SIZE = len(words)
For simplicity, we consider an embedding of 100 dimensions for our output.
We keep the window relatively small -> 2. That is, we would consider 2 words to the left, and 2 words to the right of the center word.
Plus, we randomly take 4 negative samples for every neighbor of a center word.
EMBED_DIM = 100
WINDOW_SIZE = 2
NEG_SAMPLE_RATIO = 4
We initialize two types of weights for a single word -
- when is a center word
- when is in the context but not a center word
W_center = np.random.rand(VOCAB_SIZE, EMBED_DIM)
W_cntxt = np.random.rand(VOCAB_SIZE, EMBED_DIM)
The procedure for our neural network forward function -
def sigmoid(x):
return 1 / (1 + np.exp(-x))
We compute the loss in each step using this formula -
Finally, we backpropagate the loss using this set of formulas -
# train model
# learning rate
lr = 9e-5
num_iters = 30
for i in range(num_iters):
iter_loss = 0
loss = 0
for idx, word in enumerate(words):
cidx = list(range(max(0, idx-WINDOW_SIZE), min(idx+WINDOW_SIZE+1, DOC_SIZE)))
# find index for each word in the context
cidx = [word2idx[words[i]] for i in cidx]
# find index for the center word
idx = word2idx[word]
external = [word2idx[w] for w in vocab if word2idx[w] not in cidx]
neighbors = cidx.copy()
# remove center word from context to get neighbors
neighbors.remove(idx)
center_v = W_center[idx]
loss = 0
for c in neighbors:
cntxt_v = W_cntxt[c]
# find probability of cntxt being close to the center word
p = sigmoid(center_v @ cntxt_v)
# sample negatives
negs = random.sample(external, k=min(NEG_SAMPLE_RATIO, len(external)))
for neg in negs:
neg_v = W_cntxt[neg]
# update negative probability
p *= sigmoid(center_v @ (-neg_v))
loss += -np.log(p)
# backpropagate
W_cntxt[c] = cntxt_v - lr * (sigmoid(center_v @ cntxt_v) - 1) * center_v.reshape(1, -1)
for neg in negs:
neg_v = W_cntxt[neg]
W_cntxt[neg] = neg_v - lr * sigmoid(neg_v @ center_v) * center_v.reshape(1, -1)
W_center[idx] = center_v - lr * sigmoid(neg_v @ center_v) * neg_v.reshape(1, -1)
W_center[idx] = W_center[idx] - lr * (sigmoid(cntxt_v @ center_v) - 1) * cntxt_v.reshape(1, -1)
loss /= WINDOW_SIZE
iter_loss += loss
print(f"loss for iter {i+1}: {iter_loss:.2f}")
loss for iter 2: 80.54
loss for iter 3: 73.82
loss for iter 4: 59.39
loss for iter 5: 52.19
loss for iter 6: 43.23
loss for iter 7: 35.33
loss for iter 8: 24.53
loss for iter 9: 17.13
loss for iter 10: 7.98
loss for iter 11: 4.49
loss for iter 12: 3.04
loss for iter 13: 2.46
loss for iter 14: 2.40
loss for iter 15: 2.01
loss for iter 16: 2.92
loss for iter 17: 2.12
loss for iter 18: 1.97
loss for iter 19: 2.05
loss for iter 20: 1.99
loss for iter 21: 1.70
loss for iter 22: 1.57
loss for iter 23: 2.02
loss for iter 24: 1.94
loss for iter 25: 2.11
loss for iter 26: 1.58
loss for iter 27: 1.51
loss for iter 28: 1.56
loss for iter 29: 1.60
loss for iter 30: 1.58
Find embedding for a sample word -
W_center[word2idx['money']]
Now a bit of visualization:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# use principal component analysis to reduce dimensionality: 100 -> 4
pca = PCA(n_components=4)
# create an inverse lookup table to find index from word
idx2word = {idx:word for word, idx in word2idx.items()}
def pca4to2(p4):
p2 = np.zeros((p4.shape[0], 2))
# weighted: give more weights to first 2 components
p2[:,0] = 0.75*p4[:,0] + 0.25*p4[:,3]
p2[:,1] = 0.75*p4[:,1] + 0.25*p4[:,2]
return p2
count = 15
# random select to test more critically
idxs = np.random.randint(VOCAB_SIZE, size=count)
testing = W_center[idxs].copy()
# find corresponding words to attach in figure
words = [idx2word[i] for i in idxs]
# decrease dimensionality for 2D visualization: 4 -> 2
projected_points = pca4to2(pca.fit_transform(testing))
plt.figure(figsize=(12, 8))
plt.scatter(projected_points[:,0], projected_points[:,1])
for txt, point in zip(words, projected_points):
plt.annotate(f'{txt}', xy=(point[0], point[1]), xytext=(3, 3), textcoords='offset points')
GloVe
GloVe (Global Vectors for Word Representation) is a word embedding model that represents words as dense vectors based on their co-occurrence patterns in a text corpus. It captures semantic and syntactic relationships between words, allowing words with similar meanings or contexts to be mapped to nearby points in a vector space.
This time, we would refer the GloVe paper.
# we intialize each element of co-occurrence matrix with 1
# to deal with log operation that is to be done during cost estimation
co_matrix = np.full((VOCAB_SIZE, VOCAB_SIZE), 1, dtype=np.float32)
# fill the matrix
for idx, word in enumerate(words):
cidx = list(range(max(0, idx-WINDOW_SIZE), min(idx+WINDOW_SIZE+1, DOC_SIZE)))
cidx = [word2idx[words[i]] for i in cidx]
idx = word2idx[word]
neighbors = cidx.copy()
neighbors.remove(idx)
# increase count by 1 for each column corresponding to each neighbor
co_matrix[neighbors] = co_matrix[neighbors] + 1
Observing the proposed model, for each word in our vocabulary, we initialize 2 types of word embedding matrices, and 2 types of biases -
- when is a center word
- when is in the context but is not the center word
We maintain the same (100) embedding dimension.
W_w = np.random.rand(VOCAB_SIZE, EMBED_DIM)
W_c = np.random.rand(VOCAB_SIZE, EMBED_DIM)
b_w = np.random.rand(VOCAB_SIZE)
b_c = np.random.rand(VOCAB_SIZE)
Finally, we take the laid-out suggestion for the weight function -
def weight_function(x, x_max=100, alpha=0.75):
return (x/x_max)**alpha if x < x_max else 1
# train
lr = 2e-3
num_iters = 30
for _ in range(num_iters):
J = 0
for i in range(VOCAB_SIZE):
for j in range(VOCAB_SIZE):
# forward propagation
J += weight_function(co_matrix[i][j]) * (W_w[i] @ W_c[j] + b_w[i] + b_c[j] - np.log(co_matrix[i][j]))**2
# find gradients
dJ_dW_w = weight_function(co_matrix[i][j]) * \
2 * (W_w[i] @ W_c[j] + b_w[i] + b_c[j] + - np.log(co_matrix[i][j])) * W_c[j].reshape(1, -1)
dJ_dW_c = weight_function(co_matrix[i][j]) * \
2 * (W_w[i] @ W_c[j] + b_w[i] + b_c[j] - np.log(co_matrix[i][j])) * W_w[j].reshape(1, -1)
dJ_db_w = weight_function(co_matrix[i][j]) * \
2 * (W_w[i] @ W_c[j] + b_w[i] + b_c[j] - np.log(co_matrix[i][j]))
dJ_db_c = weight_function(co_matrix[i][j]) * \
2 * (W_w[i] @ W_c[j] + b_w[i] + b_c[j] - np.log(co_matrix[i][j]))
# backpropagate
W_w[i] = W_w[i] - lr*dJ_dW_w
W_c[j] = W_c[j] - lr*dJ_dW_c
b_w[i] = b_w[i] - lr*dJ_db_w
b_c[i] = b_c[i] - lr*dJ_db_c
print(f"Average cost for iter {_ + 1}: {J/VOCAB_SIZE}")
BERT
BERT or Bidrectional Encoder Representations from Tranformers is designed to pre-train bi-directional representations from unlabelled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create a state of the art model for a wide range of NLP tasks. It was introduced and referenced in this paper.
BERT Architecture with 12 Encoder Blocks
Unlike word2vec or GLOVe which are context-free models that generate a single word embedding representation for each word in the vocabulary. An instance can be the word 'deposit' would have the same representation in 'sand deposit' and in 'bank deposit'.
Contextual models like BERT instead generate a representation of each word that is based on the other words in the sentence, by capturing word relationships bidirectionally.
There are four types of pre-trained versions of BERT depending on the scale of the model architecture:
1. BERT-Base (Cased / Un-Cased): 12-layer, 768-hidden-nodes, 12-attention-heads, 110M parameters
2. BERT-Large (Cased / Un-Cased): 24-layer, 1024-hidden-nodes, 16-attention-heads, 340M parameters
BERT relies on a Transformer (the attention mechanism that learns contextual relationships between words in a text). A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. Since BERT’s goal is to generate a language representation model, it only needs the encoder part. The input to the encoder for BERT is a sequence of tokens, which are first converted into vectors and then processed in the neural network.

Input embedding is a combination of 3 embeddings:-
1. Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
2. Segment embeddings:A marker indicating Sentence A or Sentence B is added to each token. This allows the encoder to distinguish between sentences.
3. Positional embeddings:A positional embedding is added to each token to indicate its position in the sentence.
Pre-Training Tasks
- Masked Language Modelling
- Next Sentence Prediction
MLM(masked language modelling) is the task of predicting the next word given a sequence of words. In MLM, instead of predicting every next token, a precentage of input tokens are masked at random and only those masked tokens are predicted. Denoted by [MASK]
ABout 80% of the tokens are actually replaced with the token [MASK], 10% are replaced with random token, and rest 10% are left unchanged
Next Sentence Prediction is a binary classification task in which, given a pair of sentences, it is predicted if the second sentence is the actual next sentence of the first sentence. Easy, in case of a monolingual corpus as ours, and is helpful in minimising the loss function.

Implementation
Here however we will be using a pre-trained BERT model and tokenizer from the transformers library, as training the model from scratch requires high computational resources and time
# Import required libraries
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertTokenizer, BertModel
LEARNING_RATE = 2e-5
MAX_ITERATIONS = 20
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load pre-trained BERT model
model = BertModel.from_pretrained('bert-base-uncased')
warnings.warn(
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
# Since BERT tokenizer works on sentences will join the `words` list
sentence = "".join(words).lower()
# Tokenize the sentence and get input IDs and attention masks
inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True) # pt --> pytorch tensor
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
# A classifier on top of the embeddings
class BertClassifier(nn.Module):
def __init__(self, bert_model):
super(BertClassifier, self).__init__()
self.bert = bert_model
self.linear = nn.Linear(bert_model.config.hidden_size, 2) # Assuming binary classification
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
cls_output = outputs.last_hidden_state[:, 0, :] # CLS token's output
linear_output = self.linear(cls_output)
return linear_output
# Initialize the classifier
classifier = BertClassifier(model)
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(classifier.parameters(), lr=LEARNING_RATE)
target = torch.tensor([1]) # Batch size of 1
# Training loop
for iteration in range(MAX_ITERATIONS):
classifier.train()
# Zero gradients
optimizer.zero_grad()
# Forward pass
outputs = classifier(input_ids, attention_mask)
# Compute loss
loss = criterion(outputs, target)
# Backward pass and optimization
loss.backward()
optimizer.step()
# Print loss
print(f'Iteration {iteration + 1}/{MAX_ITERATIONS}, Loss: {loss.item()}')
Iteration 2/20, Loss: 0.32331499457359314
Iteration 3/20, Loss: 0.12396910041570663
Iteration 4/20, Loss: 0.037976574152708054
Iteration 5/20, Loss: 0.016252057626843452
Iteration 6/20, Loss: 0.013878156431019306
Iteration 7/20, Loss: 0.0051431492902338505
Iteration 8/20, Loss: 0.0026047846768051386
Iteration 9/20, Loss: 0.00357310613617301
Iteration 10/20, Loss: 0.00220282468944788
Iteration 11/20, Loss: 0.0022082962095737457
Iteration 12/20, Loss: 0.0019396792631596327
Iteration 13/20, Loss: 0.001653733546845615
Iteration 14/20, Loss: 0.0013194911880418658
Iteration 15/20, Loss: 0.001127441762946546
Iteration 16/20, Loss: 0.0014465117128565907
Iteration 17/20, Loss: 0.0009894242975860834
Iteration 18/20, Loss: 0.001303895260207355
Iteration 19/20, Loss: 0.0009843033039942384
Iteration 20/20, Loss: 0.0012832987122237682
# Set the model to evaluation mode
model.eval()
# Get embedding for a token
token_to_embed_1 = 'terrorist'
token_to_embed_2 = 'president'
# Tokenize the word
tokenized_word_1 = tokenizer(token_to_embed_1,return_tensors='pt')
tokenized_word_2 = tokenizer(token_to_embed_2,return_tensors='pt')
# Forward pass through the BERT model
with torch.no_grad():
output_1 = model(**tokenized_word_1)
output_2 = model(**tokenized_word_2)
# Extract embedding
token_embedding_1 = output_1.last_hidden_state.mean(dim=1) # mean of all token embeddings
token_embedding_2 = output_2.last_hidden_state.mean(dim=1) # mean of all token embeddings
print(f"Embedding for the word {token_to_embed_1} is: ")
print(token_embedding_1)
print("cosine similarity is:")
print(token_embedding_1@token_embedding_2.reshape(-1,1))
tensor([[ 1.7374e-02, -1.6452e-01, -5.7014e-01, -7.3958e-03, -6.2166e-02,
-4.3628e-01, 4.7165e-01, 1.9113e-01, -1.1920e-01, -1.7361e-01,
4.7690e-02, -3.2405e-03, -8.4639e-02, 9.2072e-02, -3.7923e-01,
-2.8636e-01, -1.3807e-01, 4.1172e-02, 7.0807e-02, 1.3038e-01,
-1.2498e-01, -1.5599e-01, -1.6036e-02, -1.1918e-01, 5.4789e-02,
3.0128e-01, -3.6652e-01, 1.3710e-01, -3.2896e-01, 2.1204e-01,
-1.7523e-01, -6.0807e-02, 1.4287e-02, 2.3996e-01, 2.4385e-01,
3.9706e-02, 6.9867e-02, -1.2082e-01, -4.2180e-01, -2.0469e-02,
-1.3474e-01, -2.0092e-01, 2.5502e-01, 1.1924e-01, -1.2062e-01,
-2.6600e-01, -2.1552e-01, 8.4123e-03, -1.6412e-01, -3.1441e-02,
-3.5701e-01, 4.4373e-01, -1.7808e-01, 1.7894e-01, 1.7257e-01,
1.0512e-01, 2.0778e-02, -7.6860e-04, 7.3791e-02, -2.1187e-01,
-3.9251e-02, 7.4863e-02, -1.2597e-01, -1.3604e-01, 5.8049e-01,
2.5568e-01, 3.9865e-01, 7.6369e-02, -6.8682e-01, -4.3632e-02,
-1.0416e-01, -4.3556e-01, 2.4590e-01, 3.7624e-02, 6.0529e-01,
-1.9854e-01, -1.3737e-01, 2.4818e-01, -1.6765e-01, -4.9088e-02,
3.1055e-01, 4.1254e-01, 4.2589e-02, 5.4121e-01, -2.0799e-01,
1.8512e-01, -3.6973e-01, -5.4129e-02, -5.0890e-01, 4.6086e-01,
1.8898e-01, 2.0085e-01, 2.5830e-01, 1.4630e-01, 2.3012e-02,
4.4696e-01, -2.8692e-01, 3.5843e-03, -2.2246e-01, -4.6079e-01,
-8.3254e-02, -1.9580e-01, -2.3790e-01, 3.7847e-01, 4.8426e-02,
1.6665e-01, -2.6331e-02, -1.3410e-02, -7.3453e-02, -6.2583e-01,
5.1965e-01, -2.0439e-01, 7.1314e-02, -2.5241e-01, -1.7771e-01,
6.9408e-01, 3.1573e-01, 4.9768e-02, 3.4828e-01, 2.1807e-01,
-1.8289e-01, -5.0266e-01, -1.9424e-01, 6.9391e-01, 7.2278e-02,
2.4239e-01, -7.0995e-02, -3.3472e-01, -6.3998e-03, -2.5730e-01,
6.1839e-02, 5.7214e-01, 1.0169e-01, 3.6946e-01, -2.1977e-01,
2.3836e-01, 1.1300e-01, 9.7882e-02, -2.8541e-01, 3.7439e-02,
-3.6598e-01, 1.2230e-02, -7.8547e-01, -4.4632e-01, 3.5109e-01,
1.4650e-03, 2.3460e-01, 2.2567e-02, 4.5791e-01, 8.9199e-02,
3.2520e-01, 1.0676e-01, -2.5990e-01, -2.3131e-01, -2.0538e-01,
-5.4250e-02, -1.1687e-02, 1.4347e-01, 2.9123e-01, 7.3830e-01,
2.4275e-01, 1.3773e-01, -6.8173e-02, -1.9949e-02, -4.4236e-01,
2.8279e-01, -3.2199e-01, 8.3012e-01, 3.6534e-01, -8.1679e-03,
-6.2365e-01, -1.3127e-01, 4.1455e-01, 4.3910e-01, -1.1335e-01,
1.7755e-01, 2.0580e-01, 5.8238e-02, 1.4792e-01, -1.6017e-01,
-2.8908e+00, 3.3250e-01, -2.2279e-02, 5.1337e-02, 3.9644e-01,
-1.8556e-01, -1.1301e-01, -2.2297e-01, 1.1938e-01, -2.1306e-01,
-3.2057e-01, -4.6897e-02, -1.8914e-01, 2.8036e-01, 3.9287e-01,
-2.9726e-02, 7.6915e-03, -5.2280e-01, 2.2415e-01, 2.7073e-02,
7.2728e-02, -1.3288e-01, 1.8475e-01, 3.3101e-01, -3.0454e-01,
7.2411e-01, 4.6715e-01, -2.7659e-01, 1.5219e-01, -5.8501e-02,
-5.5993e-01, 2.8742e-01, 2.0288e-02, -2.4110e-02, 2.8136e-01,
-2.4255e-01, 1.9112e-02, -8.8211e-02, -5.9576e-01, -1.6185e-01,
1.4276e-01, -2.6153e-01, -5.3152e-01, 1.8835e-01, -2.3019e-01,
-2.4095e-01, 9.4526e-02, 6.6041e-02, 6.7397e-02, -3.5743e-01,
2.3702e-01, -1.7415e-01, 4.1808e-02, -1.1598e-01, -7.7135e-01,
2.0321e-01, -1.5089e-01, -2.4067e-01, -3.1895e-02, -3.6711e-01,
-2.0804e-01, 3.1501e-01, 3.2660e-01, 1.2082e-01, -3.0045e-01,
-2.0734e-01, 4.9077e-01, 1.2268e-02, 6.2455e-01, -1.0781e-01,
-6.1569e-02, -2.8082e-01, -4.4145e-02, -4.5275e-01, 1.4175e-01,
-1.4871e-01, -1.1092e-01, -6.1441e-02, 2.1058e-01, -1.4943e-01,
2.8474e-01, 7.9897e-02, 6.3942e-01, -3.2015e-01, -5.0298e-01,
1.6845e-01, 4.2726e-01, 8.8949e-02, -8.9578e-02, 1.5914e-01,
2.6664e-02, 4.7797e-03, 2.7704e-01, -8.6888e-01, 2.2669e-01,
-1.9772e-01, 3.2692e-01, 3.3523e-01, 2.5474e-01, -2.2578e-01,
9.4904e-02, 1.5555e-02, -2.4100e-01, 2.5706e-01, 1.8509e-01,
-4.1456e-01, -4.7500e-01, -2.3337e-01, 2.5100e-01, -3.9031e-01,
-1.8198e-01, 1.1013e-01, -1.4849e-01, 2.0227e-01, 1.0688e-01,
-1.4474e-01, 2.1827e-01, 3.1942e-01, -1.7299e-01, -2.8821e-01,
-2.2286e-01, 1.2801e-01, -2.8908e-01, -2.0557e-01, 1.4127e-01,
3.8739e-01, 1.9482e-01, 1.3820e-01, -1.7439e+00, -2.0959e-01,
3.1546e-01, -5.2282e-01, -1.2011e-01, -9.9153e-02, -1.4425e-01,
-1.6252e-01, -4.8081e-01, 1.1375e-01, 2.5212e-01, -3.3073e-01,
4.2043e-01, 1.5742e-01, 4.3158e-02, -5.4502e-01, 3.9188e-03,
-3.0648e-01, -5.5903e-01, 3.2359e-01, -2.8754e-02, -9.7245e-02,
3.7213e-01, -1.4245e-01, 2.6586e-01, 3.2699e-02, -1.2301e-01,
7.4300e-02, -4.5865e-01, -1.1042e-01, 4.8299e-02, -2.3259e-01,
-4.6253e-02, 1.7153e-01, 1.6015e-01, -1.1401e-01, 3.2382e-01,
1.1283e-01, 9.6822e-01, 9.8066e-02, 3.5814e-01, -3.1002e-02,
3.9688e-02, 2.5047e-01, 1.7072e-01, -3.2220e-01, 3.2966e-01,
-3.4974e-01, -7.4430e-02, 3.8098e-01, 4.7787e-01, 5.3740e-02,
5.8294e-02, -9.4360e-02, 6.2948e-02, 2.2589e-02, 2.8809e-01,
1.8303e-01, -9.7257e-02, 1.1382e-01, 6.1926e-01, -9.8930e-02,
2.0666e-01, 2.8610e-01, -4.6880e-01, 3.2755e-03, -1.0650e-01,
-6.3853e-01, 3.9208e-02, 1.2266e-01, -2.6378e-01, -3.4516e-01,
-6.2254e-02, -9.0627e-01, -5.5940e-01, 3.8747e-01, -4.5787e-01,
3.5740e-02, 3.1509e-01, -5.0804e-01, 2.5322e-01, -2.1896e-02,
3.5997e-01, 4.9920e-02, 2.6515e-01, 2.0475e-02, 1.6092e-01,
4.5559e-02, 9.7092e-02, -1.7848e-01, -1.7692e-01, 2.9759e-01,
4.4902e-01, 7.1770e-02, 3.0802e-01, 4.9680e-02, 2.1703e-01,
-2.2044e-01, 4.3650e-02, 2.2700e-01, -9.8553e-01, -4.9050e-01,
-1.4076e-01, -1.4841e-02, 9.4821e-02, -2.5781e-01, -5.3490e-01,
-2.3523e-01, -2.5670e-01, -7.4060e-02, 2.9034e-01, -4.7304e-01,
5.5103e-01, 3.2994e-01, -8.0111e-02, -1.7448e-01, 5.3687e-02,
9.0574e-01, 1.4649e-01, 5.3517e-01, 7.8099e-02, 1.6347e-01,
-1.7786e-01, 3.3222e-01, 4.8870e-01, 2.5751e-03, 5.9294e-02,
-1.4275e-01, -7.6163e-02, -2.0194e-01, 7.3848e-02, 9.3533e-02,
-2.3505e-01, -2.9641e-01, -9.6473e-02, 1.1315e-01, 1.0917e-02,
3.8741e-01, 2.5681e-01, 5.0390e-01, 4.7539e-01, -3.4693e-02,
-4.7955e-01, 1.4327e-02, 1.7856e-01, 6.1175e-01, -2.5811e-01,
3.2652e-01, 6.3100e-01, 6.1104e-01, 1.4287e-01, -7.0911e-02,
2.6511e-01, -2.6766e-01, 4.9784e-02, -2.5562e-01, -1.6326e-01,
-3.0549e-01, -5.2391e-01, 5.3236e-02, -2.9136e-01, -4.9793e-02,
-2.9023e-01, 1.5197e-01, 7.5060e-02, 2.7989e-01, 2.1854e-01,
1.8411e-01, -6.4709e-01, 5.0442e-01, 4.0934e-01, -4.8464e-01,
-4.5851e-02, 1.4055e-01, -2.3368e-01, 2.5023e-01, -1.4681e-01,
-2.8516e-01, -7.5703e-02, 2.6399e-01, -6.2981e-02, -3.0975e-01,
3.4879e-02, 3.3440e-02, 5.4803e-01, -8.4437e-02, -2.6980e-01,
-4.5518e-01, 1.2045e-01, 7.0954e-01, 2.0876e-01, -2.1347e-01,
-3.2815e-01, -8.8190e-02, -4.2917e-02, -1.1838e-01, 1.9356e-01,
1.4475e-01, 1.9469e-01, 1.9725e-01, 2.5357e-01, -1.6775e-01,
-1.1067e-01, -4.6477e-02, 2.2638e-02, -4.3648e-02, -1.7693e-01,
4.3160e-01, -3.5970e-01, -2.3965e-01, 7.6310e-02, -3.7940e-01,
-4.6879e-01, -2.7393e-01, -2.5441e-01, 5.1971e-01, 3.2835e-01,
2.9298e-02, 2.7637e-01, -1.7388e-01, -7.0169e-01, -1.8746e-01,
9.1159e-02, 2.0937e-01, 5.1887e-02, -2.7277e-01, -2.8968e-01,
-2.2699e-01, -1.1926e-01, -1.6115e-01, 1.3262e-01, 8.3340e-01,
-1.3687e-02, -1.5976e-01, 5.5867e-01, 2.7761e-01, -3.8776e-01,
-1.0979e-01, -3.2263e-01, -1.9592e-01, 3.1763e-01, -4.5718e-01,
-4.8048e-01, -3.3771e-01, -1.1748e-01, 4.2848e-01, 3.0792e-01,
6.2228e-01, 4.7312e-01, 1.6552e-01, -3.7293e-01, 1.4558e-01,
-5.5195e-02, -2.5213e-01, -1.7276e-03, -7.7818e-01, -3.8730e-01,
9.5898e-03, 1.0205e-01, -1.6529e-01, -3.4595e-01, 4.7259e-01,
1.9848e-01, -7.2310e-04, 2.1165e-01, -2.5182e-02, -4.5804e-01,
1.0086e-01, 3.2469e-01, 2.8286e-01, -4.9294e-01, 1.4354e-02,
1.0865e-01, 2.4188e-01, -5.8536e-01, -1.2675e-02, 3.4785e-01,
-7.0521e-01, -6.4848e-01, -1.4943e-02, 3.4161e-01, 7.5072e-01,
5.4468e-02, -5.2427e-02, 4.0159e-01, -3.2333e-01, -3.5813e-01,
-6.2457e-02, -4.6571e-01, -4.4504e-01, 3.0531e-01, 1.4617e-01,
-4.1137e-02, 6.2343e-01, 1.5395e-02, 5.4916e-01, -5.5378e-01,
-1.3170e-01, -6.5212e-01, 2.8995e-01, -1.2990e-01, 1.0328e-01,
-1.1949e-01, -3.0469e-01, -6.2084e-02, 2.5031e-02, 2.8510e-01,
-4.3567e-01, 1.6961e-01, -1.7390e-01, -6.2654e-01, -3.4774e-02,
-8.7120e-02, 5.2153e-01, -7.2084e-01, 7.6163e-02, 8.0006e-02,
-9.6215e-01, 1.0568e-01, 4.4655e-01, -9.1746e-02, -1.1843e-01,
5.0608e-01, 2.6498e-01, 3.9108e-01, 4.9756e-01, -4.3559e-01,
-2.3057e-01, -1.0404e-01, 3.7336e-01, 2.0653e-02, 1.6370e-01,
-1.0397e-01, 4.6485e-01, 8.2398e-02, -4.4004e-01, 4.5274e-01,
2.5663e-01, -3.3702e-01, -3.4331e-01, 1.3202e-01, -9.2540e-02,
2.6597e-02, -1.0421e-01, 2.1927e-02, 9.5780e-02, 1.5237e-01,
-7.0300e-02, 1.4973e-01, 1.3685e-01, 3.9456e-01, -1.6356e-01,
-1.1651e-01, 2.3131e-01, 9.1016e-02, -3.3953e-01, 1.4095e-01,
1.4021e-01, -3.3963e-01, -8.3755e-01, 9.3300e-02, -1.8119e-01,
1.4321e-01, -1.5538e-01, 3.9440e-01, 3.0387e-01, -3.7832e-01,
-4.2179e-01, -3.4581e-01, -1.1036e-01, 5.2689e-01, -5.7472e-02,
-2.7016e-01, 1.3547e-01, 1.1869e-01, 1.7344e-01, -2.2004e-01,
2.0525e-01, -6.9932e-01, -1.5563e-01, -1.2096e-01, -2.7858e-03,
-2.5921e-01, -3.9257e-01, 8.4317e-01, -3.7869e-01, -2.3204e-01,
-2.5009e-01, 2.8395e-01, -1.6173e-01, 7.5455e-01, 3.8367e-02,
-4.3744e-02, -2.0631e-01, 8.5053e-02, 4.3077e-02, 2.8687e-01,
-7.2215e-02, 8.1219e-02, 2.2306e-01, 4.1470e-01, 1.1718e-03,
-3.4961e-01, -1.3233e-01, -3.6323e-01, 3.5453e-02, 5.2420e-01,
2.6686e-01, -1.9821e-01, -3.3358e-01, -1.1295e-01, 1.0244e-01,
-1.0089e+00, 5.6049e-01, -9.6835e-02, 3.7474e-01, 5.8637e-02,
2.7519e-02, 2.9308e-01, -1.2999e-01, -7.7437e-02, -2.3723e-01,
3.6427e-01, 1.6328e-01, -2.0383e-01, 6.6237e-02, -4.7130e-01,
1.3019e-01, 1.1225e-01, -1.6424e-01, -3.6243e-01, 4.3517e-01,
-2.2682e-01, 1.8185e-02, 2.8673e-01, 1.1268e-02, 2.5277e-01,
9.9137e-02, -2.4715e-01, -2.7223e-01, 9.6292e-02, 1.1488e-01,
1.7338e-01, -5.2914e-01, -2.1420e+00, -4.7387e-01, -3.6633e-01,
7.5884e-02, 2.2852e-01, -5.6951e-01, 3.3331e-01, -2.6739e-02,
3.3701e-01, -3.2786e-01, 1.7956e-01, -5.6017e-01, -7.6565e-02,
-8.0955e-02, -8.4948e-02, -2.8519e-01]])
cosine similarity is:
tensor([[72.9747]])